1 Model training

This tutorial follows the vignettes written by Ben Schmidt to illustrate his wordVectors package. See the introductary and exploration vignettes. See, too, his longer blog post on vector space models for the humanities

This tutorial walks through training the model on our Nobel corpus which is a very small corpus for word embedding. This is done more as a demonstration of the wordVectors package than as something likely to give us valuable insight. First we write our Nobel corpus to a text file and save it to our current directory, then we instruct wordVectors to prep this file by tokenizing it, changing all upper cases to lower (we’ve already done this but the package doesn’t know that) and looking for commonly occurring bigrams. Then we train the model, which requires it write to another file in so doing, which we duly indicate.

# install.packages("devtools")
# devtools::install_github("bmschmidt/wordVectors")
library(wordVectors)
library(magrittr)
library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
write_lines(nobel$AwardSpeech, "nobel.txt")
prep_word2vec(origin="nobel.txt",destination="nobel_prep.txt",lowercase=T,bundle_ngrams=2)
model <- train_word2vec("nobel_prep.txt","nobel_vectors.bin", vectors = 200, threads = 4 , window = 10, iter = 5, negative_samples = 10, force = TRUE)

2 Vector operations

With our model trained, the most obvious thing to do is to look at individual words and see which other words are closest to them in terms of cosine similarity.

model %>% closest_to("peace", n = 15)
##              word similarity to "peace"
## 1           peace             1.0000000
## 2           noble             0.6221961
## 3      congresses             0.6208113
## 4    prerequisite             0.6186160
## 5       realities             0.6091034
## 6      fraternity             0.5951835
## 7         secured             0.5921423
## 8  reconciliation             0.5900457
## 9   understanding             0.5874524
## 10      bjørnson             0.5821669
## 11       enduring             0.5805418
## 12    foundations             0.5805184
## 13      promotion             0.5803314
## 14        genuine             0.5780679
## 15    advancement             0.5711615

closest_to allows for easy vector addition and subtraction. We can, for example, try the classic (and perhaps a bit tired) example:

model %>% closest_to(~"king"+"woman"-"man")
##               word similarity to "king" + "woman" - "man"
## 1             king                              0.8287436
## 2    martin_luther                              0.6857282
## 3  andrei_sakharov                              0.6314907
## 4           carlos                              0.6060730
## 5   nelson_mandela                              0.6017250
## 6         visiting                              0.5819703
## 7       literature                              0.5804038
## 8          company                              0.5725947
## 9          clinton                              0.5610304
## 10          gentle                              0.5602383

Well that didn’t work! But we shouldn’t really be surprised, we’re using a tiny corpus and one not likely to be talking too much about kings or queens. More meaningful for this sort of corpus might be:

model %>% closest_to(~"nuclear" + "peace")
##                        word similarity to "nuclear" + "peace"
## 1                   nuclear                         0.8696372
## 2                     peace                         0.7575096
## 3                  test_ban                         0.7009185
## 4                preventing                         0.6819124
## 5                   obama's                         0.6798778
## 6                explosions                         0.6730134
## 7  international_physicians                         0.6665144
## 8                   testing                         0.6653571
## 9                    ican's                         0.6646791
## 10                     ican                         0.6543100
model %>% closest_to(~"nuclear" - "peace")
##               word similarity to "nuclear" - "peace"
## 1          nuclear                         0.7231207
## 2           atomic                         0.4170524
## 3       explosions                         0.4168727
## 4             test                         0.4154024
## 5  nuclear_weapons                         0.4098148
## 6             bomb                         0.4070871
## 7          testing                         0.4063369
## 8          warfare                         0.3809506
## 9            bombs                         0.3749656
## 10   nuclear_tests                         0.3655815

A rough approximation of Kozlowski, Taddy, and Evans (2019) might be to construct a “cultural” vector (we’ll just use one binary pair and take the difference rather than averaging over multiple pairs) and then measuring cosine similarity to other words – ie to what extent they point in the direct of our vector (in the direction of “peace”) or towards “violence” (which will be a negative number, the lower the more similar).

peace <- model[rownames(model) == "peace"]
violence <- model[rownames(model) == "violence"]
pv_spectrum <- peace-violence
cosineSimilarity(pv_spectrum, model[["treaty"]])
##           [,1]
## [1,] 0.2830068
cosineSimilarity(pv_spectrum, model[["armistice"]])
##            [,1]
## [1,] 0.08531787
cosineSimilarity(pv_spectrum, model[["violation"]])
##           [,1]
## [1,] -0.146485
cosineSimilarity(pv_spectrum, model[["aggression"]])
##            [,1]
## [1,] -0.3525665
cosineSimilarity(pv_spectrum, model[["war"]])
##            [,1]
## [1,] -0.1732886

All told and for such a small corpus, this seems not half bad.

3 Plotting

We might also try to plot this cultural axis using two binary opposite word pairs and then see where other words land in similarity to the difference between the binaries (similar to what we did above). In order to subset our total corpus, we’ll plot the 200 words most similar to “politics”.

violation <- model %>% 
  closest_to(~ "violation"-"rights",n=Inf) 
war <- model %>% 
  closest_to(~ "war" - "peace", n=Inf)
politics <- model %>%
  closest_to("politics", n = 200)

politics %>%
  inner_join(violation) %>%
  inner_join(war) %>%
  ggplot() + 
  geom_text(aes(x=`similarity to "violation" - "rights"`,
                y=`similarity to "war" - "peace"`,
                label=word))

wordVectors includes multiple nice plotting features. One is via principle component analysis, which reduces many dimensions to a smaller number (here 2) based on the two most informative dimensions running through the original many-dimensional space. Here we’ll try to compare words grouped around “peace”.

peacewords <- model %>% closest_to("peace", n = 50)
peace = model[[peacewords$word,average=F]]
plot(peace,method="pca")

Or we can use t-sne, another dimension reduction method to project our word vectors onto two-dimensional space.

plot(model,perplexity=50)

This definitely shows words that tend to show up together. Perhaps some interesting things here, though historians are probably likely to find graphs like this most interesting compared over time.

Kozlowski, Austin C, Matt Taddy, and James A Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review 84 (5): 905–49.


2022.